Visualizing Stylistic Variation

نویسندگان

  • Jussi Karlgren
  • Troy Straszheim
چکیده

Texts vary not only by topic, but by style; indeed, often the variation between texts ‘about the same thing’ can be just as noticeable as the variation between texts ‘about different things’. Some facets of this variation are quite easy to detect, and quite predictable when applied to categorization of texts by genre, functional style, or tentatively quality. Making use of such variation in an retrieval context is quite straightforward in principle; our work consists of an implementation of a visualization tool for document databases. The issues addressed include 1) choice of stylistic items to investigate, 2) composition of dimensions of variation, and 3) judicious naming of dimensions for presentation. We use use principal components analysis to combine our quite large number of stylistic items into two most significant dimensions of variation and plot the document space under consideration into a plane. This space can be used as a first or last filter in an information retrieval task. The composition of the most significant dimensions is naturally corpus dependent, as is the naming of them: our work is tested on Internet and TREC data. 1 Stylistic Variation Texts vary not only by topic, but by style; indeed, often the variation between texts ‘about the same thing’ can be just as noticeable as the variation between texts ‘about different things’. Human readers process a multitude of stylistic markers, where each one of them taken separately will be almost meaningless, to categorize texts in functional styles or genres, or to assess their position along some continuum of stylistic variation. Some markers of this type are quite easy to identify and compute. We are most interested in examining the stylistic variation based on the specific genres or functional styles (Vachek, 1975) that can be found in electronically published documents – as opposed to very subjective or situation-specific measures such as individual style or even writing quality. Methods such as ours have been used previously for authorship determination in cases where documents have unknown or disputed authors with some success, and for readability measurement for educational and mass-market reading materials with some lesser degree of success. Conceivably similar metods could be used for quality determination: determining which of two texts about the same subject in the same genre is the better text in some or any sense. 2 Text in Uniform Guise Digital information technology has been vectored towards the production of information, and the publishing threshold for information has been lowered dramatically the past few hundred years. By contrast, comparatively little work has been put into tools for the consumer. Indeed, many of the markers such as paper quality, typesetting, and even spelling, that readers have been able to use previously to distinguish the New York Times from home produced handouts have been neutralized through the advent of inexpensive proofreading tools and the World Wide Web. On the Internet the publishing threshold is very low, and usefulness of the abundance is offset by the less than perspicuous variation in quality, provenance, and author intentions. 3 Aim of these experiments This paper will describe some experiments made as a groundwork to build a tool which will display a set of texts as points on a plane, scattered according to stylistic criteria. We will not go into the experiments in every detail, but we will attempt to describe how we motivate the more important design choices we make. Our hypotheses are that there are important stylistic cues in electronically published texts; that these cues can be used for categorizing or sorting documents in an interactive information retrieval scenario; that the stylistic variation can most handily be explained in terms of genres. 1060-3425/97 $10.00 (c) 1997 IEEE Proceedings of The Thirtieth Annual Hawwaii International Conference on System Sciences ISBN 0-8186-7862-3/97 $17.00 © 1997 IEEE Variable name Statistic Typical Range WORDS Text length in words 31-9228 TT Type token ratio 0.13-0.89 CPW Average word length in characters 4.59-9.95 WPS Average sentence length in words 2.45-63.1 P1 Proportion first person pronouns of words 0-105 P2 Proportion second person pronouns of words 0-20 P3 Proportion third person pronouns of words 0-60 IT Proportion ‘it’ of words 0-44 NT Proportion contractions: I’ll, you’re, etc. 0-33 Table 1: Stylistic items under consideration

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic Analysis and Visualization of Stylistic Genres

In this paper we are going to present a method capable of visualizing and automatically determining similarity between stylistic genres in Swedish and English text documents. The method has also been implemented and tested with good results. It is based on linguistic methods and uses only the raw word content of documents, meaning that it requires no metatags or other features set by the author...

متن کامل

Stylistic Variation in an Information Retrieval Experiment

Texts exhibit considerable stylistic variation. This paper reports an experiment where a corpus of documents (N= 75 000) is analyzed using various simple stylistic metrics. A subset (n = 1000) of the corpus has been previously assessed to be relevant for answering given information retrieval queries. The experiment shows that this subset differs significantly from the rest of the corpus in term...

متن کامل

Modeling of Stylistic Variation in Social Media with Stretchy Patterns

In this paper we describe a novel feature discovery technique that can be used to model stylistic variation in sociolects. While structural features offer much in terms of expressive power over simpler features used more frequently in machine learning approaches to modeling linguistic variation, they frequently come at an excessive cost in terms of feature space size expansion. We propose a nov...

متن کامل

Ëû×× Áò×øøøùøø Óó Óñôùøøö Ëëëëòòò Ëøýðð×øøø Üôööññòø× Óö Áòòóöññøøóò Êêøöööúð Âù××× Ããöððööò

Information retrieval systems are built to handle texts as topical items: texts are tabulated by occurrence frequencies of content words in them, under the assumption that text topic is reasonably well modeled by content word occurrence. But texts have several interesting characteristics beyond topic. The experiments described in this text investigate stylistic variation. Roughly put, style is ...

متن کامل

Textual Stylistic Variation: Choices, Genres and Individuals

T his chapter argues for more informed target metrics for the statistical processing of stylistic variation in text collections. Much as operationalized relevance proved a useful goal to strive for in information retrieval, research in textual stylistics, whether application oriented or philologically inclined, needs goals formulated in terms of pertinence, relevance, and utility—notions that a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997